Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(RFC): Adds altair.datasets #3631

Draft
wants to merge 234 commits into
base: main
Choose a base branch
from
Draft

feat(RFC): Adds altair.datasets #3631

wants to merge 234 commits into from

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 4, 2024

Related

Tracking

Waiting on the next vega-datasets release.
Once there is a stable datapackage.json available - there is quite a lot of tools/datasets that can be simplified/removed.

Discovered a bug that makes some handling of expressions a little less efficient.

Upstreaming some nw.Schema stuff to narwhals

Description

Providing a minimal, but up-to-date source for https://github.com/vega/vega-datasets.

This PR takes a different approach to that of https://github.com/altair-viz/vega_datasets, notably:

Examples

These all come from the docstrings of:

  • Loader
  • Loader.from_backend
  • Loader.__call__
from altair.datasets import Loader

load = Loader.from_backend("polars")
>>> load
Loader[polars]

cars = load("cars")

>>> type(cars)
polars.dataframe.frame.DataFrame

load = Loader.from_backend("pandas")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

load = Loader.from_backend("pandas[pyarrow]")
cars = load("cars")

>>> type(cars)
pandas.core.frame.DataFrame

>>> cars.dtypes
Name                       string[pyarrow]
Miles_per_Gallon           double[pyarrow]
Cylinders                   int64[pyarrow]
Displacement               double[pyarrow]
Horsepower                  int64[pyarrow]
Weight_in_lbs               int64[pyarrow]
Acceleration               double[pyarrow]
Year                timestamp[ns][pyarrow]
Origin                     string[pyarrow]
dtype: object

load = Loader.from_backend("pandas")
source = load("stocks")

>>> source.columns
Index(['symbol', 'date', 'price'], dtype='object')

load = Loader.from_backend("pyarrow")
source = load("stocks")

>>> source.column_names
['symbol', 'date', 'price']

Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well
- Shorter names `Read`, `Scan`
- The single unique method is now `into_scan`
- There was no real need to have concrete classes when they behave the same as parent
Resolves:
```py
File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv()
TypeError: Cannot convert dict to pyarrow._csv.ParseOptions
```
Also simplifies and removes outdated `Extension`-related tooling
- Reduced the scope a bit, now just un/supported
- Added `pprint` option
- Finished docs, including example pointing to use `url(...)`
Comment on lines +294 to +304
# TODO: Open an issue in ``narwhals`` to try and get a public api for type conversion
def schema_pyarrow(self, name: _Dataset, /):
schema = self.schema(name)
if schema:
from narwhals._arrow.utils import narwhals_to_native_dtype
from narwhals.utils import Version

m = {k: narwhals_to_native_dtype(v, Version.V1) for k, v in schema.items()}
else:
m = {}
return nw.dependencies.get_pyarrow().schema(m)
Copy link
Member Author

@dangotbanned dangotbanned Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Will be even more useful after merging vega/vega-datasets#663
- Thinking this is a fair tradeoff vs inlining the descriptions into `altair`
  - All the info is available and it is quicker than manually searching the headings in a browser
dangotbanned added a commit that referenced this pull request Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants